-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 64 #52
Issue 64 #52
Conversation
Rationale The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo. Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through. Nevertheless the concern is present and the code provided by Kik has proven real. Kik's discovery relies on two basic assumptions : - each block's header_hash is modifiable along with the providing of an according value for field "extraData" - due to above mining can be carried out bypassing, almost completely, DAG memory accesses and "transforming" in a "goal seeking" only on header_hash using high Keccak computational power. As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below) The two above assumptions are valid for **both ethash and progpow** algorithms even if the latter, due to different width of "seed" is way more affected. Both ethash and progpow share the same "macro" steps with a difference on step 3: - input data (header_hash + nonce) go through a Keccak round which produces a "seed" (256 bits for ethash, 64 bits for ProgPoW) - "seed" drives the access to DAG memory locations which eventually produce a "mix_hash" (256 bits for both algos) - In **ethash** last round of Keccak uses as input : 16 words as carry-over from previous keccak round + 8 words mix_hash; in **progpow** the input is 8 words from header_hash + "seed" + 8 words from mix_hash The logic concatenation of operations allows **in ProgPoW** (Kik's code) to: 1 skip the first step and directly set an arbitrary value for "seed" 2 obtain mix_hash (doing only one round of memory accesses) 3 linearly iterate through "extraData" increments and build a `new_header_hash` (Keccak256) starting from the header_hash sent by work provider 4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3 5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a **nonce** which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set 6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 4 7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData. The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply. It goes without saying this method is highly impractical and economically unviable on ethash as of tiday ... but is there. Basically in both algos we have : - seed == F(header_hash, nonce) - mix == F(seed) // where memory access occurr - final == F(seed, mix) This means also that setting an arbitrary seed we get - mix == F(seed) // where memory access occurr - final == F(seed, mix) // where I find a "new_header" - nonce == F(new_header, seed) // which can be iterated very quickly Thus making possible to brute force keccak while bypassing memory accesses Purpose of this PR is to extend the dependency from header_hash to mix too so having - seed == F(header_hash, nonce) - mix == F(header_hash) // where memory access occurr - final == F(seed, mix) Thus having **both** mix and seed being dependant on header_hash thus making impossible to goal seek a seed without having to also change the mix.
@solardiz - Would appreciate your eyes on this. @AndreaLanfranchi - Will review shortly. Are we able to get resulting profiler results for both AMD (perhaps we'll need assistance from a community member) as well as NVIDIA? We will also need to ensure this is also applied to libprogpow and libethash-cl. |
plus amend compile errors
bits consumed in fill_mix are only initiators to KIS99
Benchmarking on Windows EVGA Gtx 1060 3Gb 0.9.3.
0.9.4 (proposed)
|
Hi all. I am impressed by how quickly all of you jumped in to try and fix this. That's great! I will probably have more comments, but for starters and most relevant to your current experiments: somehow in my experiments I was able to double the number of "ProgPoW VMs" registers and when done along with doubling the block size (from 256 to 512 bytes) this had very little performance impact - much less than we're seeing above. So maybe it's the opportunity to revisit that and try to fix the new issue on top of those changes, so that the revised ProgPoW would also require more registers in custom hardware? |
Looks like a 9.78% hit on hashrate? |
@AndreaLanfranchi Can you try settings like what I had in my 0.9.3m1 proposal? -
Due to merging of #35 back then, this should just work. In 0.9.3m1+, I also had a code tweak to use groups of 4 sequential 512-byte blocks each, thus random reads of 2048-byte blocks, which helped Maxwell a lot. I guess that isn't essential for your testing on Pascal and newer, but is something we can try re-adding as well. In the above, I set REGS to 64, but with more register pressure from the fix here I guess you can reasonably try e.g. 56 or 48 as well. |
The tweak I had proposed in #40 (with two versions of code changes that are ready for use) might also help regain some of the performance. |
@solardiz 0.9.3m1
Worth mention that those tweaks have no effect at all in mitigation of #51 |
@AndreaLanfranchi Is this the 0.9.3m1 settings (but not its code tweaks) + your code tweaks proposed here? In other words, a slight speedup over what you had? If so, can you also try REGS 56, 48, or such? Thanks! |
No. This is knob tweaks in third column applied over 0.9.3. |
@AndreaLanfranchi Hmm, that's a bigger slowdown than I expected. Maybe the move to 2048 byte blocks is needed to reduce the cost of those changes on Pascal as well. Can you try my final code from https://github.com/solardiz/ProgPOW (known as 0.9.3m2) on its own and along with your changes? Thanks! |
@solardiz I will as soon as I have a chance. You have my word. :) This issue though is not related to hashrate chasing rather to void the "bypass memory hardness" issue. Would like all of us to focus on that, |
@AndreaLanfranchi Of course, we're primarily fixing the vulnerability here. We just want to do so with minimum impact on hashrate, and maybe also use the opportunity to merge other previously postponed tweaks that make better use of GPUs vs. specialized ASICs. Also, I imagine it should be possible to write the code such that we explicitly spill the now-larger intermediate hash value to memory (maybe to shared aka local? or even to global+cache, which is probably better than having the compiler spill something more frequently used), and then load it back from memory when it's needed, so that there's no increase in register pressure during the main loop. |
@solardiz this proposal (CUDA only yet) increase registers use from 56 to 72 and no spills. |
@AndreaLanfranchi OK, I probably need to actually review/try it before speaking, but since I already started: from the source, it looks like the increase in register pressure inside the per-lane loop is only slight, by maybe 3 registers for |
Inside the per-lane loop there is no increase in register usage. |
@AndreaLanfranchi What block number are your benchmark results above for? I'd probably use the same for mine. |
For me, the slowdown with these changes (as-is, except for the Hunter/Boost build fixes) on GTX 1080 for block 10M appears to be 3.1% using the Trial 10 speeds below: @AndreaLanfranchi's master branch (cbd1c16):
@AndreaLanfranchi's Issue-64 branch (04f4bee):
|
For me, the below hack reduces the register usage from 72 to 64 (original was 63 here) and completely removes the performance impact on the GTX 1080 here: +++ b/libethash-cuda/CUDAMiner_kernel.cu
@@ -159,12 +159,14 @@ progpow_search(
// Force threads to sync and ensure shared mem is in sync
__syncthreads();
- uint32_t state[25]; // Keccak's state
uint32_t hash_seed[2]; // KISS99 initiator
hash32_t digest; // Carry-over from mix output
+ uint32_t state8[8];
+{
// Absorb phase for initial round of keccak
// 1st fill with header data (8 words)
+ uint32_t state[25]; // Keccak's state
for (int i = 0; i < 8; i++)
state[i] = header.uint32s[i];
// 2nd fill with nonce (2 words)
@@ -177,6 +179,10 @@ progpow_search(
// Run intial keccak round
keccak_f800(state);
+ for (int i = 0; i < 8; i++)
+ state8[i] = state[i];
+}
+
// Main loop
#pragma unroll 1
for (uint32_t h = 0; h < PROGPOW_LANES; h++)
@@ -184,8 +190,8 @@ progpow_search(
uint32_t mix[PROGPOW_REGS];
// share the first two words of digest across all lanes
- hash_seed[0] = __shfl_sync(0xFFFFFFFF, state[0], h, PROGPOW_LANES);
- hash_seed[1] = __shfl_sync(0xFFFFFFFF, state[1], h, PROGPOW_LANES);
+ hash_seed[0] = __shfl_sync(0xFFFFFFFF, state8[0], h, PROGPOW_LANES);
+ hash_seed[1] = __shfl_sync(0xFFFFFFFF, state8[1], h, PROGPOW_LANES);
// initialize mix for all lanes using first
// two words from header_hash
@@ -217,6 +223,10 @@ progpow_search(
digest = digest_temp;
}
+ uint32_t state[25]; // Keccak's state
+
+ for (int i = 0; i < 8; i++)
+ state[i] = state8[i];
// Absorb phase for last round of keccak (256 bits)
// 1st initial 8 words of state are kept as carry-over from initial keccak
To have greater confidence that this is right, debugging output needs to be enabled (such as with #46) to make sure the computed hashes stay the same. Also, someone needs to implement the same changes in host-only code such as https://github.com/solardiz/c-progpow and in OpenCL, and make sure all 3 compute the same hashes. |
I'm testing on block 0 (epoch 0) |
Nice catch and confirm it voids performance impact.
Nevertheless, optimizations aside, I'd like a comment on effectiveness of change wrt the issue raised by Kikx. Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the digest selection is correct.
Why not we fix this also? #49 |
@ethcoreorg cause it's outside the scope of this PR. |
@blondfrogs you're absolutely right. |
LLVM is smart enough to figure out which elements are zeroes and which are copied.
This IMHO doesnt solve the issue and keeps us in the situation of 0.9.3 as, again, we can pick an arbitrary seed (64 bits) value which produces a digest. brute force last keccak round to find a header (+extraNonce + nonce) producing a result matching target and with a seed output identical to the one chosen. We're still bruteforcing Keccak in search of a 64 bit value. With carry-over from intial keccak instead of header (pretty much like in ethash) we force the linear search of header+nonce which produces the same keccak squeeze on a 256 bit basis. |
I'd personally add more ... in ethash the first keccak absorb is constrained (10th word is 0x00000001 and 17th word is 0x80000000) and the second too (24th word is 0x00000001 and 37th word is 0x80000000). Constraining Keccak absorb (initial and last) further reduces (practically voids) the likelyhood of a succesful brute force attempt.. |
This is what I described as "variation of the attack that would still work would involve re-doing steps 3 and 4, then 5 again, etc. until both conditions are met at once" and explained why it's "clearly impractical". That said, I agree we'd better defeat even this possibility as well, which we'd do by increasing seed size as part of addressing #54.
I don't understand this. To me, your changes are equivalent to the trivial change I proposed. The variation of attack described above is possible for both. The only difference appears to be 2 vs. 1 Keccak computations per trial, which is an insignificant difference for our decision-making. |
... in search of an identical 256 bit value instead one of 64 bit. |
I don't see your proposal achieve that. Given your own description:
Here, let's assume |
Yes ... sorry ... please focus on code ... description is messed up. |
I see no relevant differences in code, and that part of the description actually looks good enough for the present discussion, despite of the simplifications. The part of description that was relevantly-wrong and unusable for the discussion was "extend the dependency from header_hash to mix too". |
This looks good to us. It would be good to get profiling and performance results for a HBM2 card and a GDDR5X card. |
0.88% L2 is ok? |
It's been a long time. Is there a reason this cannot be merged, or at least referenced as correct? I'm wanting to get a nearly-final EIP-1057 to the other core devs soon. @AndreaLanfranchi @ifdefelse @OhGodAGirl @lookfirst @solardiz @blondfrogs |
Thanks @ifdefelse ! |
Amend #51
Rationale
The problem highlighted by Kik revealed a vulnerable condition which allows the mining activity to be carried out bypassing, almost completely, the characteristics of memory-hardness embedded into the algo.
Must be evidenced that this issue could be exploited only in bundle with a custom node implementation and the current "public" mining infrastructure does not allow the threat vector to go through.
Nevertheless the concern is present and the code provided by Kik has proven real.
Kik's discovery relies on two basic assumptions :
As additional consideration a "custom node" must be modifed to accept a mining result from work consumer with a modified header_hash (see below)
The two above assumptions are valid for both ethash and progpow algorithms even if the latter, due to different width of "seed" is way more affected.
Both ethash and progpow share the same "macro" steps with a difference on step 3:
The logic concatenation of operations allows in ProgPoW (Kik's code) to:
1 skip the first step and directly set an arbitrary value for "seed"
2 obtain mix_hash (doing only one round of memory accesses)
3 linearly iterate through "extraData" increments and build a
new_header_hash
(Keccak256) starting from the header_hash sent by work provider4 Apply new_header_hash + seed + mix as input to last keccak round and compare first two words of digest result to target. If matching go ahead otherwise goto 3
5 We now have a new_header_hash and an "extraData" solving the algo. One thing left to do is to find a nonce which, combined with new_header in the initial Keccak round, produces a seed (and by consequence a mix_hash) which is equal to the one arbitrary set
6 To do so simply scan nonce range (2**64) and do as many Keccak rounds as possible to find the right one. Should range be exhausted goto 3
7 Eventually the miner will notify the "custom node" that the hashing has found a solution (nonce) changing header_hash due to extraData.
The difference in width for the seed and the different absorb phase for last keccak round makes the issue orders of magnitude more relevant in ProgPoW than in ethash but nevertheless the same pattern apply.
It goes without saying this method is highly impractical and economically unviable on ethash as of today ... but is there.
Basically in both algos we have :
This means also that setting an arbitrary seed we get
Thus making possible to brute force keccak while bypassing memory accesses
Purpose of this PR is to extend the dependency from header_hash to mix too so having
Thus making final hash depend on 256 bit digest from initial keccak which depends from header and nonce.